tle: “Lab 3: Association Rules” thor: “Che Cobb, Andy Heroy, Carson Drake, David Josephs” te: “2019-08-11” tput: html_document: toc: TRUE number_sections: true toc_float: collapsed: true smooth_scroll: false theme: paper df_print: paged keep_md: TRUE

Project Overview

Please we need to discuss project motivation and our experiment, just as Rida asked

Then we can begin our analysis

Project Setup

Business Understanding

  • [10 points] Describe the purpose of the data set you selected (i.e., why was this data bcollected in the first place?). How will you measure the effectiveness of a good algorithm? Why does your chosen validation method make sense for this specific dataset and the stakeholders needs?

We chose this dataset from the UCI’s machine learning repository for its categorical predictive attributes. It contains 1994 Census data pulled from the US Census database. The prediction task we’ve set forth is to predict if a person salary range is >50k in a 1994, based on the various categorical/numerical attributes in the census database. The link to the data source is below:

https://archive.ics.uci.edu/ml/datasets/census+income

The effectiveness of our algorithm will be determined by support, confidence and lift. As these are the metrics that describe how strong a relationship between each element is with the other elements within each transaction. Currently, there are no methods for cross validation of association rules, although there are some hard working individuals out there that are attempting to create such a tool.

  • TODO - ADD DEFINITION FOR LIFT/CONFIDENCE/SUPPORT HERE

Data Understanding

  • [10 points] Describe the meaning and type of data (scale, values, etc.) for each attribute in the data file. Verify data quality: Are there missing values? Duplicate data? Outliers? Are those mistakes? How do you deal with these problems?
  • [10 points] Visualize the any important attributes appropriately. Important: Provide an interpretation for any charts or graphs.

Here we will discuss each attribute and give some description about its ranges.

Categorical Attributes

  • workclass - Which business sector do they work in?
  • education - What level of education received?
  • marital_status - What is their marriage history
  • occupation - What do they do for a living
  • relationship - Family member relation
  • race - What is the subjects race
  • gender - What is the subjects gender
  • native_country - Where is the subject originally from
  • income_bracket - Do they make over or under 50k/year

Continuous Attributes

  • age - How old is the subject?
  • fnlwgt - Sampling weight of observation
  • education_num - numerical encoding of education variable
  • capital_gain - income from investment sources, separate from wages/salary
  • capital_loss - losses from investment sources, separate from wages/salary
  • hours_per_week - How many hours a week did they work?

Setup

First, lets go ahead and load up necessary libraries:

Loaded Packages:
   arules, arulesViz, forcats, dplyr, plotly, data.table,
   pander, knitr, skimr, lubridate, ggplot2, cowplot,
   foreach, doParallel

Next, lets import our dataset

data <- read.csv("./data/adult-training.csv")

Data Quality Inspection

The first thing we must do is check and see if there are any NAs in our dataset, just to make sure to not mess up our analysis.

NA_sum <- sort(sapply(data, function(x) sum(is.na(x))), decreasing = TRUE)
data.frame((NA_sum))

Data Cleaning

Looks like we are doing ok here. The next issue we have in the dataset, is because of the way the csv was stored, some of the levels in our factors include leading and trailing whitespace. This is highly undesirable, so we must clean it up:

GetFactors <- function(df) {
    return(names(Filter(is.factor, df)))
}
FixLevels <- function(x) {
    levels(x) <- trimws(levels(x))
    return(x)
}
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
  • workclass: ?, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov and Without-pay
  • education: 10th, 11th, 12th, 1st-4th, 5th-6th, 7th-8th, 9th, Assoc-acdm, Assoc-voc, Bachelors, Doctorate, HS-grad, Masters, Preschool, Prof-school and Some-college
  • marital_status: Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated and Widowed
  • occupation: ?, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Other-service, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support and Transport-moving
  • relationship: Husband, Not-in-family, Other-relative, Own-child, Unmarried and Wife
  • race: Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other and White
  • gender: Female and Male
  • native_country: ?, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam and Yugoslavia
  • income_bracket: <=50K and >50K

Next, we need to reencode our data as factors. First, lets encode the education levels into factors with larger groups (for example 1st-12th grade should be no diploma, not a bunch of levels).

data$education <- fct_collapse(data$education, `No Diploma` = c("1st-4th", "5th-6th", 
    "7th-8th", "9th", "10th", "11th", "12th", "Preschool"), Associates = c("Assoc-acdm", 
    "Assoc-voc"), Diploma = c("Some-college", "HS-grad"))

Then the the income brackets:

data$income_bracket <- fct_collapse(data$income_bracket, small = "<=50K", large = ">50K")

Next, lets change the ? levels to something more useful:

levels(data$workclass)[levels(data$workclass) == "?"] <- "Other"
levels(data$occupation)[levels(data$occupation) == "?"] <- "Other-service"
levels(data$native_country)[levels(data$native_country) == "?"] <- "Other"
data[GetFactors(data)] <- lapply(data[GetFactors(data)], FixLevels)
pander(lapply(data[GetFactors(data)], levels))
  • workclass: Other, Federal-gov, Local-gov, Never-worked, Private, Self-emp-inc, Self-emp-not-inc, State-gov and Without-pay
  • education: No Diploma, Associates, Bachelors, Doctorate, Diploma, Masters and Prof-school
  • marital_status: Divorced, Married-AF-spouse, Married-civ-spouse, Married-spouse-absent, Never-married, Separated and Widowed
  • occupation: Other-service, Adm-clerical, Armed-Forces, Craft-repair, Exec-managerial, Farming-fishing, Handlers-cleaners, Machine-op-inspct, Priv-house-serv, Prof-specialty, Protective-serv, Sales, Tech-support and Transport-moving
  • relationship: Husband, Not-in-family, Other-relative, Own-child, Unmarried and Wife
  • race: Amer-Indian-Eskimo, Asian-Pac-Islander, Black, Other and White
  • gender: Female and Male
  • native_country: Other, Cambodia, Canada, China, Columbia, Cuba, Dominican-Republic, Ecuador, El-Salvador, England, France, Germany, Greece, Guatemala, Haiti, Holand-Netherlands, Honduras, Hong, Hungary, India, Iran, Ireland, Italy, Jamaica, Japan, Laos, Mexico, Nicaragua, Outlying-US(Guam-USVI-etc), Peru, Philippines, Poland, Portugal, Puerto-Rico, Scotland, South, Taiwan, Thailand, Trinadad&Tobago, United-States, Vietnam and Yugoslavia
  • income_bracket: small and large

Next, lets remove the fnlwgt, education number, and capital gain and loss columns, as they are unneeded. We also need to rename some columns to be easier for us, and use the cut function to factorize our numeric variables

data <- data[, -c(3, 5, 11:12)]
data$age <- cut(data$age, breaks = c(15, 25, 45, 65, 100), labels = c("Young", 
    "Middleaged", "Senior", "Retired"))
data$hours_per_week <- cut(data$hours_per_week, breaks = c(0, 20, 40, 60, 80), 
    labels = c("part-time", "full-time", "hard-working", "need-a-life"))
library(magrittr)
data %>% skim_to_list() %$% factor %>% select(-c("ordered", "missing", "complete")) %>% 
    kable
variable n n_unique top_counts
age 32561 4 Mid: 16523, Sen: 8469, You: 6411, Ret: 1158
education 32561 7 Dip: 17792, Bac: 5355, No : 4253, Ass: 2449
gender 32561 2 Mal: 21790, Fem: 10771, NA: 0
hours_per_week 32561 4 ful: 20052, har: 8471, par: 2928, nee: 902
income_bracket 32561 2 sma: 24720, lar: 7841, NA: 0
marital_status 32561 7 Mar: 14976, Nev: 10683, Div: 4443, Sep: 1025
native_country 32561 42 Uni: 29170, Mex: 643, Oth: 583, Phi: 198
occupation 32561 14 Oth: 5138, Pro: 4140, Cra: 4099, Exe: 4066
race 32561 5 Whi: 27816, Bla: 3124, Asi: 1039, Ame: 311
relationship 32561 6 Hus: 13193, Not: 8305, Own: 5068, Unm: 3446
workclass 32561 9 Pri: 22696, Sel: 2541, Loc: 2093, Oth: 1836

Lets see the results:

EDA

We’d also like to get a quick feel for the dataset through some visulizations.

p1 <- ggplot(data) + geom_bar(aes(x = age, fill = income_bracket), stat = "count") + 
    labs(x = "Age", y = "Count", title = "Age by Income", subtitle = "Histogram plot")
p1

Our first plot shows us a quick histogram of age groups and their income group. Not surprisingly, the majority of the “large” income group is in middleaged and senior groups supporting the common knowledge that as you get older, you should be making more money.

p2 <- ggplot(data, aes(x = education, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Education", 
    y = "Proportion", title = "Income bias based on Education", subtitle = "Stacked bar plot")
p2

Our next plot is a stacked bar chart of income group by density of education. We see here that higher education (Bachelors, Masters, Doctorate) yields a higher proportions of the “large” income bracket. Professional school also about equal with Docotorate suggesting specilaized trades and higher education are the best bet for making over 50k a year.

p3 <- ggplot(data, aes(x = marital_status, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Marital Status", 
    y = "Proportion", title = "Income bias based on Marital status", subtitle = "Stacked bar plot")
p3

Using the similar plot style as above, we’ll now look at income group by density of Marital Status. Which too no surprise to us researchers, married couples have a distinct advantage over thoes that are unmarried. In fact the next highest categories are Widowed and Divorced. Which means, if you want to make more money in life, you should at least give marriage a try.

p4 <- ggplot(data, aes(x = occupation, fill = income_bracket, color = income_bracket)) + 
    geom_bar(alpha = 0.9, position = "fill") + coord_flip() + labs(x = "Occupation Status", 
    y = "Proportion", title = "Income bias based on Occupation status", subtitle = "Stacked bar plot")
p4

Next, we’ll compare income groups by density of Occupation. This revealed a few obvious, but also a few interesting results. Exec-managerial and Prof-specialty were the highest categories to no one’s surprise. Protective-serv, transport-moving and tech-support also show good density for making over 50k. Some of the lowest categories thought were Armed-Forces, and house-cleaning/services.

p6 <- ggplot(data, aes(occupation)) + geom_bar(aes(fill = education), width = 0.5) + 
    theme(axis.text.x = element_text(angle = 60, vjust = 0.5)) + labs(title = "Histogram of occupation with education binning", 
    subtitle = "Occupation and Educational")
p6

Lastly, we combine education and ocupation to see if we can notice any particular relationships between the two groups. What kind of educations to the people with large incomes have? Exec-managerial show a large amount of bachelor’s and masters degree’s, which means, more school definitely helps in that occupation. Prof-specialty shows the best range of educational sectors respresented but we found it interesting that prof-school wasn’t a larger representation here. This is probably due to prof-specialty being a wider category than some in that you can be a professional in alot of different occupational sectors.

Armed-forces have an extremely low amount of education information which also makes sense as most are recruited right out of high school. We would have expected to see a higher amount of Diploma’s given out in this group, but perhaps the information is just missing from this dataset.

Finally, we can set up our dataset as a transactional dataset to be in the proper data format for the Apriori algorithm:

data <- as(data, "transactions")
summary(data)
#> transactions as itemMatrix in sparse format with
#>  32561 rows (elements/itemsets/transactions) and
#>  102 columns (items) and a density of 0.1077805 
#> 
#> most frequent items:
#> native_country=United-States                   race=White 
#>                        29170                        27816 
#>         income_bracket=small            workclass=Private 
#>                        24720                        22696 
#>                  gender=Male                      (Other) 
#>                        21790                       231771 
#> 
#> element (itemset/transaction) length distribution:
#> sizes
#>    10    11 
#>   208 32353 
#> 
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.00   11.00   11.00   10.99   11.00   11.00 
#> 
#> includes extended item information - examples:
#>           labels variables     levels
#> 1      age=Young       age      Young
#> 2 age=Middleaged       age Middleaged
#> 3     age=Senior       age     Senior
#> 
#> includes extended transaction information - examples:
#>   transactionID
#> 1             1
#> 2             2
#> 3             3

Modeling and Evaluation

  • Different tasks will require different evaluation methods. Be as thorough as possible when analyzing the data you have chosen and use visualizations of the results to explain the performance and expected outcomes whenever possible. Guide the reader through your analysis with plenty of discussion of the results. For this task, we chose Option B: Association Rule Mining.

Option B: Association Rule Mining • Create frequent itemsets and association rules. • Use tables/visualization to discuss the found results. • Use several measure for evaluating how interesting different rules are. • Describe your results. What findings are the most compelling and why?

Before we begin with our analysis, lets check out the rule frequencies within the dataset. We are looking for rules with support >= .2

itemFrequencyPlot(data, support = 0.2)

Rule mining

Next, lets mine some rules with the apriori algorithm, and then clean up redundant rules. We are still sorting out what to set the minsupp and minconf to.

zerules <- apriori(data, parameter = list(minlen = 2, supp = 0.2, conf = 0.15), 
    appearance = list(rhs = c("income_bracket=small", "income_bracket=large"), 
        default = "lhs"), control = list(verbose = F))
length(zerules)
#> [1] 91
redundant <- is.redundant(zerules)
zerules.pruned <- zerules[redundant == FALSE]
rulesorted <- sort(zerules.pruned, by = "lift", decreasing = TRUE)
length(rulesorted)
#> [1] 25

Rule quality and inspection

Next, let us inspect the rules, and examine their quality

(quality(rulesorted))
inspectDT(rulesorted, caption = "Association Rules")

Plots

Scatterplot

plot(rulesorted, method = "scatterplot", measure = c("support", "lift"), shading = "confidence", 
    engine = "htmlwidget")

Scatter Plot

Discussion

This plot shows us each rule, with x axis as support and Y axis as lift. They are colored by confidence. This makes a nice little view of which rules were made from the most data. These may not always be the most interesting rules, but they are definitely the strongest ones. From looking at this, we can note a few strong, interesting associations.

  • Females are associated with low income bracket

    • Conversely, being a male is (with low confidence) associated with a high income bracket.
  • Not being married is strongly associated with a low income bracket

  • High school diploma only is associated with low income

  • Private/housekeeping/server jobs are associated with low income

  • No family is also associated with low income

These strong rules are by no means earth-shattering, but they do tell us that we are looking in the right direction, as they agree with the (sometimes unfortunate) realities of the world we live in.

Balloon plot

plot(rulesorted, method = "graph", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")

Balloon Plot

Discussion

The balloon plot is able to filter and zoom, which gives the ability to look through each income group and see what rules are feeding the analysis. Just by seperating out the large income bracket, we begin to see that married-civ-spouce has one of the highest lift scores. The next two highest were white and Male. Which follows suit with our above research as well.

In the small income bracket, we see rules such as (Never-married, fulltime), (female, private/housekeeper/server), (Private/housekeeper/server, never-married). This would begins to tell us that if you’re female, never-married and work full time, then you’re chances are higher for making under 50k. It also says something negative

Parallel plot

plot(rulesorted, method = "paracoord", measure = "confidence", shading = "lift", 
    control = list(reorder = T))
Parallel Plot

Parallel Plot

Discussion

The width of the arrows represents support and the color represents confidence. We can see a few interesting things here. First of all, we have a striking level of confidence that being married gets you a large income, and then also interestingly we see that working as a private worker/housekeeper/server is a part of a lot of the rules related to being in the low income bracket (You can see this by looking where all the rules converge on LHS order 1).

Two key plot

This plot is often used to discuss the relationship between the order of the rules and their support/confidence.

plot(rulesorted, method = "two-key plot", measure = "lift", shading = "confidence", 
    engine = "htmlwidget")

Two Key Plot

Discussion

We can find several interesting bits of information here. First, we see that we only produced one rule of order 4, a few rules of order 3, and a bunch of order 2. We also note that the area with high support and confidence (upper right, a bit of a sweet spot), is dominated by the shorter rules.

Alternate Rule Mining

After our first attempt at mining for rules, we’re going to relax our support to 0.1 and increase our confidence to 98% see what we get. Aftewards, we’ll target small and large incomes individually in order to see if we can isolate the itemsets that are going to be of highest influence.

rule2 <- apriori(data, parameter = list(minlen = 2, supp = 0.1, conf = 0.98), 
    appearance = list(rhs = c("income_bracket=small", "income_bracket=large"), 
        default = "lhs"), control = list(verbose = F))
length(rule2)
#> [1] 50
redundant <- is.redundant(rule2)
rulep <- rule2[redundant == FALSE]
rulesorted2 <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted2)
#> [1] 36

We can see that this time around we’ve generated 36 rule’s. We’ll put them through the same inspection and plots as previously to see if any other rules stand out or if we’re seeing more of the same.

Inspection

head(quality(rulesorted2))
inspectDT(rulesorted2, caption = "Alternate Association Rules")

Plots

Scatterplot

plot(rulesorted2, method = "scatterplot", measure = c("support", "lift"), shading = "confidence", 
    engine = "htmlwidget")

Discussion

We see similar results as in section 8.2.1.

  • Never being married has a strong association with low income.

  • High school Diploma is a lower income indicator

  • Private jobs also carry an association with low income.

  • One difference was that Female was NOT found to carry an association with low income within these rules. Why, we are unsure.

But we also noticed that two other itemsets began to surface around the highest lift.

  • Young was seen more often within the higher threshold lift rules, meaning yes, young people don’t make as much money.

  • Owning a child also had a strong association with low income.

What was also interesting to see is that one of the rules with the highest lift, was (Young, never-married, own-child). Which fits the tragic tale of the struggling single parent within our society.

Balloon plot

plot(rulesorted2, method = "graph", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")

Discussion

Here the large income bracket didn’t appear in our graph, which validates why we didn’t see any large income factors in the scatter plot. This chart does show us similiarities between what we saw in small income. That some of the highest itemsets for low income were (Young, never-married, own-child, Diploma, Private, White, United-States). The strongest (Rule 1) being (Young, never-married, own-child)

Two key plot

plot(rulesorted2, method = "two-key plot", measure = "confidence", shading = "lift", 
    engine = "htmlwidget")

Discussion

Here we see our parameter expanding show differeng itemset levels. With the higher confidence level, we’ve gotten rid of more of the two quantity itemsets, but increased the number of higher order-itemsets. For instance, We now have 5 5-item rulesets with different combinations of all the factors we saw in the previous plot. (Young, never-married, own-child, Diploma, Private, White, United-States).

Targeting Small and Large Income

Small Income

Now that we’ve scanned a couple of different variations of apriori and generated rules accordingly. Lets split up the targeting in the rhs section to only target one of the income brackets at a time. First we’ll start with low income.

rule_small <- apriori(data, parameter = list(minlen = 2, supp = 0.12, conf = 0.95), 
    appearance = list(rhs = c("income_bracket=small"), default = "lhs"), control = list(verbose = F))
length(rule_small)
#> [1] 45
redundant <- is.redundant(rule_small)
rulep <- rule_small[redundant == FALSE]
rulesorted_small <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_small)
#> [1] 25
head(quality(rulesorted_small))
inspectDT(rulesorted_small)

It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules

plot(rulesorted_small, method = "scatterplot", measure = c("confidence", "support"), 
    shading = "lift", engine = "htmlwidget")

We will discuss the results from small income targeting in the conclusion section.

Large Income

Our next step is to now target the large income bracket and see what rules shake out

rule_large <- apriori(data, parameter = list(minlen = 2, supp = 0.05, conf = 0.55), 
    appearance = list(rhs = c("income_bracket=large"), default = "lhs"), control = list(verbose = F))
length(rule_large)
#> [1] 53
redundant <- is.redundant(rule_large)
rulep <- rule_large[redundant == FALSE]
rulesorted_large <- sort(rulep, by = "lift", decreasing = TRUE)
length(rulesorted_large)
#> [1] 15
head(quality(rulesorted_large))
inspectDT(rulesorted_large)

It looks like we’ve generated 25 rules for the low income bracket. Lets take a look at the scatter plot and find our releveant itemsets/rules

plot(rulesorted_large, method = "scatterplot", measure = c("confidence", "support"), 
    shading = "lift", engine = "htmlwidget")

We will discuss the results from large income targeting in the conclusion section.

Conclusion

Small Income

Interesting small Income rules:

  • Not being married is associated with a low income. Out of our 25 best rules, 16 of them involved not being married, and our five highest lift rules all involved never married. This is clearly not a coincidence, however what this means and the cause of this is not so clear. We propose a few hypotheses about this:

    • Married people, motivated by love, work harder and make more money

    • Shallow society deems people without money harder to marry

  • People who are married got married because they are more social, and thus they are given more raises, perform better in interviews, etc. In contrast, people who were not sociable enough to meet someone may not perform as well in interviews etc., giving them fewer opportunities to get a job

  • Having just a diploma from high school does not get you out of the lower income bracket. This is unsurprising, as today you need in general a minimum of a college education to start making good money (and even that is difficult). 5 out of our 25 rules involved highest education level = high school diploma

  • 12 out of 25 rules involved being young. This makes perfect sense, as entry level jobs which young people get do not pay very well.

  • 8 out of 25 rules involved being in the private /housekeeping/server field. This makes sense, as these workers are severly underpaid

  • 4 out of 25 rules involved having a child. If you have a child, you should be spending more time with them, and maybe focusing a little less on your career. Therefore, you will probably earn less. This pattern makes sense.

  • The rules for our small income target have a support of 12% and confidence of 95%. This means that (never-married, Diploma, Young, Private/housekeeper/server, and child) are observed in 12% of the dataset. We therefore believe that 95% of the time any of the attributes (never-married, Diploma, Young, Private/housekeeper/server, and child) are present, it will result in a small income.

Large Income

Interesting large Income rules:

  • Thankfully, opposite to the small income analysis, large income targeting shows that being married greatly influences your income as it shows up in 9 of the 15 rules present. Meaning again that by working as a team, you get more out of life. So go find yourself a husband/wife.

  • The next highest frequent itemset is hard-working that shows up in 8 of the 15 rules. Which makes perfect sense that the more time you put into a job, the likelier you are for promotion and advancement within an organization.

  • Education = Bachelors , native_country = United-States, relationship = husband are present in 5 out of the 15 rules.

  • The rules for our large income target have a support of 05% and confidence of 55%. This means that (married, hard-working, Bachelors, United-States, Husband) are observed in 5% of the dataset. We therefore believe that 55% of the time any of the attributes (married, hard-working, Bachelors, United-States, Husband) are present, it will result in a large income.

Deployment

Be critical of your performance and tell the reader how you current model might be usable by other parties. Did you achieve your goals? If not, can you reign in the utility of your modeling?

• How useful is your model for interested parties (i.e., the companies or organizations that might want to use it)? • How would your deploy your model for interested parties? • What other data should be collected? • How often would the model need to be updated, etc.?

Exceptional Work

For our exceptional work, we decided to try and optimize the number of rules produced in our apriori algorithm with the RHS being both low and high income. We decided to further optimize it by finding the ruleset which would provide us the highest average lift, while being digestible. To do that, we implemented a fully parallelized grid search.

Building a Grid

There are two parameters to optimize in the apriori algorithm: confidence and support. Normally, these are chosen using an in depth knowledge of the data combined with an understanding of the goals of the study. Because this research is purely academic, and we are new to the apriori algorithm, we do not have a prior understanding to tell us what support and confidence will be the most useful for us. So, we must optimize both. We will begin by creating two objects, conf and supp, for confidence and support respectively. Each will be vectors from 0.01 to 0.99:

x <- 1:99
conf <- supp <- x/100

Next, we will create a data frame containing all possible combinations of support and confidence:

init <- data.frame(conf = conf, supp = supp)
params <- expand.grid(init)
nrow(params)
#> [1] 9801

Defining the Search Function

Next, we would like to create a function which we can iterate through. We would like the function to take in a value of support and confidence, maybe a value of appearance (but with defaults for our specific case) and run it through the apriori algorithm. Then, we would like it to prune redundant results from the rule, get the length and average lift of the ruleset, and return those in a convenient labeled vector:

search <- function(sup, con, appearance = list(rhs = c("income_bracket=small", 
    "\n                                            income_bracket=large"), default = "lhs")) {
    rules <- apriori(transactions, parameter = list(minlen = 2, supp = sup, 
        conf = con), control = list(verbose = f))
    redundant <- is.redundant(rules)
    rulepruned <- rules[redundant == false]
    rulelength <- length(rules)
    return(c(support = sup, conf = con, nrules = rulelength, avlift = mean(quality(rulepruned)$lift)))
}

Viewing the Results

We now must define what it means to be digestible. In my opinion, between 20 and 25 rules is clearly digestible to anyone. So we will find the rules created that give us that, and we want to also find the rulesets with the highest lift. We will do so using dplyr:

digest <- results %>% filter(nrules <= 25 & nrules >= 20) %>% filter(conf > 
    0.35) %>% filter(support > 0.35) %>% arrange(desc(avlift, nrules, support, 
    conf))
digest

Looks like we have a clear winner. We will now try